Red Wine Quality by Juho Salminen

Univariate Plots Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

The dataset consists of 1599 observations of 12 variables. Variable X appears to be just an id. Quality of wines is measured by integers between 3 and 8. I suppose the full scale is from 1 to 10, but for some reason extreme values have not been used. Other variables are continuous measures of physical qualities of the wine.

Quality

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
## 
##    3    4    5    6    7    8 
##  0.6  3.3 42.6 39.9 12.4  1.1

Distribution of wine qualities is bell-shaped with median 6 and mean 5.636. The left tail appears longer, but the right tail is heavier. It might make sense to combine categories, as some of them have only a few observations.

## 
##  Low High 
## 1382  217

It might be easier to work with only two categories of wines instead of the full range of evaluations. The buyers are likely more interested in whether a wine is worth buying or not instead of exact ratings.

Acidity

Fixed acidity of the wines is concentated around the value 8, with some skew to the right. It will be interesting to see whether the best wines have the highest acidity. In that case they would be easy to identify.

Volatile acidity is much lower than fixed acidity in absolute terms. The distribution appears to have low variance with a few outlier to the right.

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Increasing the resolution reveals an interesting chasm in the middle of the distribution. Why is this?

Many wines have zero or very little citric acid. Otherwise the distribution is quite flat until it starts to decrease around 0.5. There is a curious spike at this value, and a couple less distinct ones at lower values. It looks as if the wine makers might be aiming their wines to have the amount of citric acid either zero, 0.25 or 0.5. Maybe these spikes indicate different types of wines?

Residual sugar

Most wines have low amount of residual sugar, between about 1 and 3.5. Some examples have much higher amounts of residual sugars. They are perhaps of different type, like desert wines? It will be interesting to see whether the outliers are of high or low quality.

Chlorides

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Distribution of chlorides resembles the one of residual sugar. Most values are thightly concentrated around 0.08 with a thin and long right tail all the way to 0.6. It looks as if there is a small concentration of wines around 0.4. Is this a distinct subtype or category of wines, or just an artefact in the data? Again, I’m interested to see if this group sticks out as having low or high quality.

Sulfur dioxide

Most wines have lowish amounts of free sulfur dioxide. The distribution is again right-skewed.

Same story here as with free sulfur dioxide, but about an order of magnitude higher values. I wonder what is the relationship between free and total amounts of sulfur dioxide?

There is a slightly increasing trend in additional sulfur dioxide when amount of free sulfur dioxide increases. It is still quite common for most of the total sulfur dioxide being accounted for by free sulfur dioxide. I’m creating a new variable fixed.sulfur.dioxide by calculating the difference between the two measures.

Similar distribution as with total sulfur dioxide.

Density

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Density is almost normally distributed around little less than 1, which makes sense as wine is mostly water, and alcohol is less dense than water. Density might actually be correlated with amount of alcholol.

There indeed is a downward trend with increasing alcohol levels. Stronger wines tend to be less dense.

pH

pH of the wines is almost normally distributed around 3.3. Wines are acidic.

Sulphates

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

A relatively tight distribution with some skew and outliers to the right.

Alcohol

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Wines typically have at least 9 % alcohol, around 10 % being the average and number of wines slowly decreasing as the alcohol content increases. Wine makers seem to prefer round numbers in alchohol content. There are spikes in the distribution around every .0 and .5.

Univariate Analysis

What is the structure of your dataset?

The red wine dataset consists of 1599 observations of 12 variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). Quality is an ordered categorical variable on a scale from 3 to 8, larger values being the better. Other variables are continuous.

Most wines (82.5 %) have a quality rating of 5 or 6. 7 is the third most common rating (12.4 %) while all the other quality scores cover only 5 % of the wines. Red wine is acidic (pH 2.7-4.0) and usually has only little residual sugar. Mean alcohol content of wines is 10.4 %.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. I’d like to be able to classify wines to high (quality 7 or 8) and low quality (quality 6 or lower) categories based on some combination of physical measures.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Based on the shapes of distributions, volatile acidity, citric acid, cholrides and alcohol seem promising. Especially alcohol and citric acid distributions feature curious spikes at round values, suggesting the wine makers might be aiming to have specific characteristics on these features, which implies the winemakers believe those features have something to do with the quality of the wine.

Did you create any new variables from existing variables in the dataset?

I created variable fixed sulfur dioxide by subracting free sulfur dioxide from total sulfur dioxide. I also combined quality categories into a new binary variable quality.bin. In this variable ‘high’ is assigned to wines with quality 7 or 8 and the ‘low’ is assigned to all other wines.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Several of the distributions were right-skewed. I tried a few transformations (logarithmic, cubic root, power) on some of them, but the shapes of the distributions did not improve. In the end I used the features as they are. (With hindsight, classification models could have benefitted from normalization.)

Bivariate Plots Section

Correlations and summaries by quality

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
## fixed.sulfur.dioxide   -0.07814929      0.097033939  0.06677604
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
## fixed.sulfur.dioxide    0.174529035  0.055479649         0.425148917
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
## fixed.sulfur.dioxide           0.95768634  0.09513464 -0.10805328
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000
## fixed.sulfur.dioxide  0.032244043 -0.22320257 -0.20546298
##                      fixed.sulfur.dioxide
## fixed.acidity                 -0.07814929
## volatile.acidity               0.09703394
## citric.acid                    0.06677604
## residual.sugar                 0.17452903
## chlorides                      0.05547965
## free.sulfur.dioxide            0.42514892
## total.sulfur.dioxide           0.95768634
## density                        0.09513464
## pH                            -0.10805328
## sulphates                      0.03224404
## alcohol                       -0.22320257
## quality                       -0.20546298
## fixed.sulfur.dioxide           1.00000000
## wine[, 14]: Low
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.600   Min.   :0.160    Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.420    1st Qu.:0.0825   1st Qu.: 1.900  
##  Median : 7.800   Median :0.540    Median :0.2400   Median : 2.200  
##  Mean   : 8.237   Mean   :0.547    Mean   :0.2544   Mean   : 2.512  
##  3rd Qu.: 9.100   3rd Qu.:0.650    3rd Qu.:0.4000   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.580    Max.   :1.0000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.03400   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07100   1st Qu.: 8.00       1st Qu.: 23.00      
##  Median :0.08000   Median :14.00       Median : 39.50      
##  Mean   :0.08928   Mean   :16.17       Mean   : 48.29      
##  3rd Qu.:0.09100   3rd Qu.:22.00       3rd Qu.: 65.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :165.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9958   1st Qu.:3.210   1st Qu.:0.5400   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6000   Median :10.00  
##  Mean   :0.9969   Mean   :3.315   Mean   :0.6448   Mean   :10.25  
##  3rd Qu.:0.9979   3rd Qu.:3.410   3rd Qu.:0.7000   3rd Qu.:10.90  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##  fixed.sulfur.dioxide
##  Min.   :  3.00      
##  1st Qu.: 12.00      
##  Median : 23.00      
##  Mean   : 32.11      
##  3rd Qu.: 42.00      
##  Max.   :128.00      
## -------------------------------------------------------- 
## wine[, 14]: High
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.900   Min.   :0.1200   Min.   :0.0000   Min.   :1.200  
##  1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000   1st Qu.:2.000  
##  Median : 8.700   Median :0.3700   Median :0.4000   Median :2.300  
##  Mean   : 8.847   Mean   :0.4055   Mean   :0.3765   Mean   :2.709  
##  3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900   3rd Qu.:2.700  
##  Max.   :15.600   Max.   :0.9150   Max.   :0.7600   Max.   :8.900  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 3.00       Min.   :  7.00      
##  1st Qu.:0.06200   1st Qu.: 6.00       1st Qu.: 17.00      
##  Median :0.07300   Median :11.00       Median : 27.00      
##  Mean   :0.07591   Mean   :13.98       Mean   : 34.89      
##  3rd Qu.:0.08500   3rd Qu.:18.00       3rd Qu.: 43.00      
##  Max.   :0.35800   Max.   :54.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9906   Min.   :2.880   Min.   :0.3900   Min.   : 9.20  
##  1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500   1st Qu.:10.80  
##  Median :0.9957   Median :3.270   Median :0.7400   Median :11.60  
##  Mean   :0.9960   Mean   :3.289   Mean   :0.7435   Mean   :11.52  
##  3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200   3rd Qu.:12.20  
##  Max.   :1.0032   Max.   :3.780   Max.   :1.3600   Max.   :14.00  
##  fixed.sulfur.dioxide
##  Min.   :  4.00      
##  1st Qu.:  9.00      
##  Median : 14.00      
##  Mean   : 20.91      
##  3rd Qu.: 22.00      
##  Max.   :251.50

Volatile acidity, citric acid, sulphates and alcohol have moderate correlations with wine quality. These features also have have noticably different means between low and high quality wines.

Quality by fixed acidity

No clear trends here. Poor and good wines seem to have higher fixed acidity, but on the other hand there are only a few data points on them, so the effect does not feel very trustworthy.

Comparing only two quality categories reveals that actually high quality wines tend to have higher fixed acidity. Combining quality categories is starting to look like a good idea.

Quality by volatile acidity

There is a clear decreasing trend with volatile acidity when the wine quality increases.

With the increasing quality the distribution of volatile acidity moves to left and gets narrower.

The lower the volatile acidity, the more likely the wine is to be of high quality. Looks like about 0.38 volatile acidity is the sweet spot for red wines.

Quality by citric acid

The very best wines tend to have higher amounts of citric acid.

Interesting! On average good wines tend to have a lot of citric acid, but the density plot reveals the picture is more complex. There seems to be three kinds of wines regarding citric acid: low (close to 0), medium (~0.25) and high (~0.4) amounts of citric acid. Good wines have either a little or a lot of citric acid, while other wines can have any amount of it.

Quality by residual sugar

Nothing interesting going on here.

Quality by chlorides

Move along, nothing to see here.

Quality by sulfur dioxide

Average quality wines seem to have a little more free sulfur dioxide on average, but this does not help much in differentiating high quality wines from others.

Total sulfur dioxide is a better indicator of whether a wine is good or bad.

Total sulfur dioxide by free sulfur dioxide

High quality wines seem to be along a line where amount of total sulfur dioxide compared to free sulful dioxide is low

The pattern is not very clear, though.

Getting better…

Fixed sulfur dioxide is even better discriminator than total sulfur dioxide! Good wines have low amounts of fixed sulfur dioxide. There’s two extreme outliers.

Looks like low amounts of fixed sulfur dioxide is a pre-requisite but not a guarantee for a high wine quality.

Quality by density

Higher quality wines seem to have lower density. They also have more alcohol, which could cause the correlation. It is probably a good idea to explore how different things affect the density of the wine.

Quality by pH

Better wines tend to have slightly lower pH, perhaps in connection to better wines having often higher fixed acidity. The pattern is not very clear though.

Quality by sulphates

Higher amounts of sulphates are associated with higher quality, but there are many outliers in average quality wines that muddy the relationship.

Quality by alcohol

The pattern with alcohol is a little bit U-shaped. The worst quality wines tend to have more alcohol than average wines, and then the better than average wines have increasing amounts of alcohol.

With lower resolution the pattern becomes clearer. Better wines tend to have higher amounts of alcohol.

Wines with more than 12 % alcohol are likely to have high quality, and wines with less than 10 % alcohol are likely poor.

Composition of density

Many of the features are associated with density, which makes sense. Fixed acidity and alcohol seem to have the strongest association. pH has strong correlations with acidity measures, so its association with density is likely to be result of that.

Composition of pH and acidity

The higher the acidity, the lower the pH. Surprisingly, higher volatile acidity has a weak correlation with higher pH. Maybe volatile acids are “escaping” from the wine?

Fixed and volatile acidity do not have much to do with each other, but citric acid has to do with both of them! Citric acid has positive correlation with fixed acidity and negative correlation with volatile acidity. These relationships look somewhat nonlinear.

Looks like fixed acidity is linearly related to citric acid to some power, perhaps 4.

Here the relationship looks most linear when citric acid values are squared, but the approximation is very rough.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Merging quality categories to just high (7 or 8) and low (6 or below) turned out to be helpful in clarifying the differences between wines. In summary, high quality wines tend to have relatively:

  • low volatile acidity (around 0.4)
  • either low or high amounts of citric acid (around 0.1 or 0.5)
  • low amounts of fixed sulfur dioxide
  • often highish amounts of sulphates
  • on average 11.5 % alcohol compared to 10.25 % in lower quality wines

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

In addition to relationships between physical measures and quality I investigated the composition of acidity in more detail, because two acidity measures correlated with quality, and they interact with each other. Interestingly volatile acidity has negative correlation and fixed acidity has positive correlation with citric acid, but volatile and fixed acidity do not correlate much with each other. The relationships seem to be linear in some power of citric acid, perhaps around 2 (volatile acidity) and 4 (fixed acidity). I also looked at the composition of density, which is likely a result of other measured physical properties.

What was the strongest relationship you found?

The strongest correlation I found was between total sulfur dioxide and fixed sulfur dioxide, but the correlation is a result of how the variable was created. After that fixed acidity and pH have the highest correlation (-0.68). Other similarly strong correlations include:

  • fixed acidity and citric acid
  • fixed acidity and density
  • free sulfur dioxide and total sulfur dioxide

However, for the most interesting correlation is between alcohol and quality (0.48). High quality wines tend to have a lot of alcohol.

Multivariate Plots Section

Acidity vs. quality

Plotting the wines based on their acidity measures reveals two clusters of high-quality wines:

  • low fixed acidity and citric acid, high volatile acidity
  • low volatile acidity, high fixed acidity and citric acid

Although there is overlap, many low quality wines could already be identified from this plot: wines presented with red dots above the black line and grey dots below it are likely to be of poor quality (the location of the line is approximate and only for illustration).

Sulphates, sulfur dioxide and alcohol vs. quality

Good quality wines form a rather tight cluster. Again it is possible to identify many poor quality wines visually: any grey wine and all wines above the black line are likely to be of poor quality. Classification algorithms could probably do a good job at identifying high-quality wines (quality 7 or 8) using the following features:

  • alcohol
  • fixed acidity
  • volatile acidity
  • citric acid
  • fixed sulfur dioxide
  • sulphates

Models

## K nearest neighbors predictions:
##             
## prediction_1 Low High
##         Low  391   40
##         High  26   23
## [1] 0.86
## Support vector machine predictions:
##             
## prediction_2 Low High
##         Low  408   50
##         High   9   13
## [1] 0.88
## Random forest predictions:
##             
## prediction_3 Low High
##         Low  405   32
##         High  12   31
## [1] 0.91

Indeed, k nearest neigbors, support vector machine and random forest all work pretty well even without any optimization. In this case random forest has the best performance, achieving 91 % classification accuracy on the test set.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Combination of different acidity measures (fixed acidity, volatile acidity and citric acid) turned out to be useful in visually differentiating between high and low quality wines, as did the combination of fixed sulfur dioxide, sulphates and alcohol. The clustering looked much tighter than I had expected based on the bivariate comparisons. After seeing these plots it was not a surprise that classification models performed well at predicting wine quality.

Were there any interesting or surprising interactions between features?

The biggest surprise was the interaction between sulphates and fixed sulphur dioxide. Neither of them was a strong canditate as a predictor, but together they collected high-quality wines in a tight cluster.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried three classification models on the promising features identified during the exploratory analysis. All of them worked well “out of the box”, acchieving around 90 % accuracy. In this case random forest had the best performance with 91 % accuracy on the test set. This means that based on six physical measures of the wine, the random forest model can correctly predict nine times out of ten whether the wine is of high quality. The performance of models could likely be improved further by little optimization. For instance, k-value in k-nearest neighbors model was pulled from a hat, and other models were fitted with default parameters.


Final Plots and Summary

Plot One

Description One

Distribution of quality scores for red wines is bell-shaped. As many of the categories have relatively few observations, and the main interest is in differentiating good wines from the rest, it makes sense to combine categories. Only a minority of wines is of high quality.

Plot Two

Description Two

Amount of alcohol has the strongest association with wine quality. Better wines tend to have more alcohol.

Plot Three

Description Three

Three acidity measures and amounts of fixed sulfur dioxide, sulphates, and alcohol differentiate high-quality wines from low-quality wines rather well. Dashed lines help illustrate the borders of distinct clusters. In the upper plot, wines represented by red dots above the dashed line and by grey dots below it are likely to have low quality. In the lower plot wines represented with grey dots below the dashed line and all wines above the line are likely to have low quality.


Reflection

The dataset I explored contained physical measurements of 1599 red wines, along with subjective quality scores. I started by investigating the distributions of individual variables. After that I identified promising correlations between the variables in an effort to find a set of features that could be used to predict wine quality. Instead of the full quality scale I was only interested in differentiating good wines (score 7 or 8) from the rest. Initially the dataset felt confusing and it didn’t look like there was any interesting patterns, but systematically plotting comparisons of variables slowly revealed many interesting relationships. During the analysis the largest surprise was that some of the variables that did not look very good predictors alone, worked very well when combined together. In the end I used six identified features to build three classification models. Random forest performed best, achieving 91 % classification accuracy on the test set. The performance of the models could be improved by optimization and possibly by adding new features. Some of the less promising features could still contain useful information.

I made a couple of detours during the analysis by investigating the relationships of density and pH with other variables they had high correlations with, and by looking to interactions of different acidity measures in detail. In the end I did not get anything useful out of this exploration, except perhaps the decision to leave density and pH out of further analysis, because the information they contain seemed likely to be already accounted for by other variables.